Introduction

The purpose of this project is to create a model/tool similar to Zillow’s Zestimate tool that can predict unknown home prices based on a mixture of internal, external, and spatial variables. Creating an accurate tool will provide home buyers and owner a better understanding of the value of their home while also helping government to better estimate the appropriate tax rate for house properties. The lack of openly available government data for this area makes adds a particular level of difficulty when creating this model. Additionally, areas in the western portion of the county exhibit a stark difference in spatial commonality that make the model more difficult to apply there than in the eastern side of the county. Our strategy is to use the best possible data, include it in a series of analysis and then transform the data into the most sui format. An algorithm will guide us to the most optimal variable combination that will be used throughout the rest of the modeling. In the final model, the home sale price is as a function of its physical characteristics, school attendance choice, distance to the designated high school, accessibility to parks and trails, location of historic districts and exposure to crimes. We were able to create a comprehensive model that was to predict ~80% of the variation in home prices and only observed an error rate of ~13% in the generalizbility of the model.

# for Lan
setwd("/Users/lexi/Desktop/UPENN_COURSE/MUSA_508/assignment/assignment2/MUSA508_house_price_prediction")

# for Austin
#setwd("C:/Users/austi/Desktop/Penn/Classes/PublicPolicy/Labs/Lab2/MUSA508_house_price_prediction")
# color set up
# yellow to green
palette1 <- c('#f2d6a2', '#dfd688', '#ccd270', '#b8ca58', '#a4bf41')  # lighter one
palette2 <- c('#ffd358', '#e1cf5b', '#b9bc4e', '#879a32', '#4f6d00')   # stronger one
palette3 <- c('#f2d6a2', '#e9d596', '#dfd38a', '#d6d17e', '#ccce72', '#c2cb66', '#b8c85a', '#aec44d', '#a4bf41')  # lighter one with more gradient
palette4 = c(
    "#1a7328",
    "#61a262",
    "#f25d50",
    "#f26a4b",
    "#f2b4ae"
)
palette5 = c('#dcecff', '#d7c1de', '#dfb4cd', '#e3a8bf', '#e49db3', '#e491a8', '#e2869e', '#e07a95', '#de6f8c', '#da6383', '#d5597c', '#cf4f74', '#c7466c', '#c03c63', '#b8335a', '#b02a50', '#a72146', '#9f183c', '#950e32', '#8c0327')
palette6 = c('#d4a2af','#872f3b')
palette7 = c('#dcecff', '#ddb5c9', '#cf8296', '#b25367', '#872f3b')

Data Wrangling

We utilized several methods for gathering the data that was used in this project. Our main source of data was via webscraping from the Boulder County Open Site where we obtained several datasets including parks, trail heads, and . As the City of Boulder is a separate entity within the county, we webscraped historic districts from the City of Boulder Open Data Site. As crime data was not openly available from the county or city at all, nor availble at a useful geography level from the state or national level, we contacted the Boulder County Sheriff’s Office who provided us with the data for a fee. We were able to use national sources including the FCDG, the Centers for Medicare & Medicaid Services, and the Census Bureau (from whom we derived our definition of neighborhoods for this use case).

Data Loading and Cleaning

After bringing in our datasets from their various sources we cleaned and manipulated the datasets to ensute that they included only what was necessary in the analysis and would run through the rest of the project without any issues. For different types of data we used different methods to measure the property’s exposure to external variables such as the ones mentioned above.

load property-information data

physical_data <- st_read("data/studentData.geojson") %>% 
  st_set_crs('ESRI:102254') %>% 
  filter(price < 30000000)

change NA in AcDscr into without

physical_data$AcDscr[is.na(physical_data$AcDscr)] <- "without"
physical_data$Ac[is.na(physical_data$AcDscr)] <- "without"

load Boulder city boundary

city_boundary <- st_read("https://opendata.arcgis.com/datasets/955e7a0f52474b60a9866950daf10acb_0.geojson") %>% 
  st_transform('ESRI:102254') %>% 
  st_union() %>% 
  st_buffer(1) %>%
  st_sf()

load Boulder county boundary

county_boundary <- st_read("https://opendata.arcgis.com/datasets/964b8f3b3dbe401bb28d49ac93d29dc4_0.geojson") %>% 
  st_transform('ESRI:102254') 

School

load and clean school attendance area data

# load elementaryry attendance are data 
# BVSD school district 
area.BVSD.elementary <- st_read("data/BVSD_elmtr_area.geojson") %>% 
  st_transform('ESRI:102254') %>% 
    dplyr::select(SchName, State_CD)
area.BVSD.elementary$SchName <- toupper(area.BVSD.elementary$SchName)
# merge area have 2 school options
# only FID 1
area.FID1 <- st_union(
    area.BVSD.elementary %>% 
    filter(SchName  == "CREEKSIDE"),
  area.BVSD.elementary %>% 
    filter(SchName == "BC-CREEKSIDE  ELEMENTARY SCHOOL")
) %>% 
    dplyr::select(SchName, State_CD)
area.FID2 <- st_union(
    area.BVSD.elementary %>% 
    filter(SchName  == "MESA"),
  area.BVSD.elementary %>% 
    filter(SchName == "BC-MESA ELEMENTARY SCHOOL")
) %>% 
    dplyr::select(SchName, State_CD)
# rbind back
area.BVSD.elementary <- rbind(
  area.FID1,
  area.FID2,
  area.BVSD.elementary %>% 
    dplyr::filter(State_CD != "0") %>% 
    dplyr::filter(SchName != "MESA" & SchName != "CREEKSIDE") %>% 
    dplyr::select(SchName, State_CD)
) 
# St Vrain Valley School District 
area.SVV.elementary <- st_read("data/SVV_elmtr_area.geojson") %>% 
  st_transform('ESRI:102254') %>% 
    dplyr::select(SchName)
area.SVV.elementary$SchName <- toupper(area.SVV.elementary$SchName)
# Estate Park & Thompson
area.other.elementary <- st_read("data/SABS_1516_SchoolLevels/SABS_1516_Primary.shp") %>% 
  st_transform('ESRI:102254') %>% 
  filter(leaid == "0803810" | leaid == "0805400")%>%
  dplyr::select(schnam)
area.other.elementary <- area.other.elementary[county_boundary,]
area.other.elementary$schnam <- toupper(area.other.elementary$schnam)

# load high school attendance are data 
# BVSD District 
area.BVSD.high <- st_read("data/BVSD_hsc_area.geojson") %>% 
  st_transform('ESRI:102254') %>% 
  dplyr::select(SchName, State_CD, FID)
area.BVSD.high$SchName <- toupper(area.BVSD.high$SchName)
# merge area have 2 school options
# only FID 1
area.FID1 <- st_union(
  area.BVSD.high %>% 
    filter(FID == "1"),
  area.BVSD.high %>% 
    filter(FID == "2")
) %>% 
  dplyr::select(SchName, State_CD)
# rbind FID 1 back
area.BVSD.high <- rbind(
  area.FID1,
  area.BVSD.high %>% 
    dplyr::filter(FID != "2" & FID != "1") %>% 
    dplyr::select(SchName, State_CD)
) 
# St Vrain Valley School District 
area.SVV.high <- st_read("data/SVV_hsc_area.geojson") %>% 
  st_transform('ESRI:102254') %>% 
    dplyr::select(SchName)
area.SVV.high$SchName <- toupper(area.SVV.high$SchName)
# Estate Park & Thompson
area.other.high <- st_read("data/SABS_1516_SchoolLevels/SABS_1516_High.shp") %>% 
  st_transform('ESRI:102254') %>% 
  filter(leaid == "0803810" | leaid == "0805400")%>%
  dplyr::select(schnam)
area.other.high <- area.other.high[county_boundary,]
area.other.high$schnam <- toupper(area.other.high$schnam)

load school data

# load school data
schools <- st_read("data/Public_Schools.geojson") %>% 
  st_transform('ESRI:102254')

select schools are operational and regular

# select schools are operational and regular
schools <- rbind(
  schools %>% 
    dplyr::filter(NAME == "BROOMFIELD HIGH SCHOOL"), # This school isn't within Boulder, but some Boulder residents should attend it
  schools %>% 
    # select regular schools
    dplyr::filter(TYPE == "1") %>% 
    # select operational schools
    dplyr::filter(STATUS == "1") %>% 
    dplyr::filter(NAME != "NEW VISTA HIGH SCHOOL")  # This school isn't in school district list
)
schools <- schools[st_buffer(county_boundary,10000),]

load school rating data

# load and filter by district
school.ratings <- read_csv("data/SPF_Final Ratings_2010-201.csv")

# transfername to capital
school.ratings$Name <- toupper(x=school.ratings$Name)

combine school area with school rating data

# join BVSD high
school.high <- right_join(
  school.ratings,
  area.BVSD.high,
  by = c("SchoolNumber" = "State_CD")
) %>% 
  dplyr::select(DistrictName, Name, Total_Percent_Points, geometry)
# join SVV high
school.high <- rbind(
  right_join(
  school.ratings,
  area.SVV.high,
  by = c("Name" = "SchName"))%>% 
    dplyr::select(DistrictName, Name, Total_Percent_Points, geometry),
  school.high
)
# join Estate Park & Thompson high
school.high <- rbind(
  right_join(
  school.ratings,
  area.other.high,
  by = c("Name" = "schnam"))%>% 
    dplyr::select(DistrictName, Name, Total_Percent_Points, geometry),
  school.high
)
# trim within Boulder
school.high <- st_sf(school.high)
school.high <- st_intersection(school.high, county_boundary) %>% 
  dplyr::select(DistrictName, Name, Total_Percent_Points, geometry)

# join BVSD elementary
school.elementary <- right_join(
  school.ratings,
  area.BVSD.elementary,
  by = c("SchoolNumber" = "State_CD")
)%>% 
    dplyr::select(DistrictName, Name, Total_Percent_Points, geometry)
# join SVV elementary
school.elementary <- rbind(
  right_join(
  school.ratings,
  area.SVV.elementary,
  by = c("Name" = "SchName"))%>% 
    filter(DistrictName == "St Vrain Valley RE1J") %>% 
    dplyr::select(DistrictName, Name, Total_Percent_Points, geometry),
  school.elementary
)
# join Estate Park & Thompson elementary
school.elementary <- rbind(
  right_join(
  school.ratings,
  area.other.elementary,
  by = c("Name" = "schnam"))%>%
    dplyr::select(DistrictName, Name, Total_Percent_Points, geometry),
  school.elementary
)
# trim within Boulder
school.elementary <- st_sf(school.elementary)
school.elementary <- st_intersection(school.elementary, county_boundary)%>% 
  dplyr::select(DistrictName, Name, Total_Percent_Points, geometry)

combine school point with school rating data

# high school
school.high.point <- school.high %>% 
  st_drop_geometry() %>% 
  left_join(
    schools,
    by = c("Name" = "NAME")
  ) %>% 
  st_sf() %>% 
  filter(COUNTYFIPS %in% c("08013", "08014", "08069", "08123") |  is.na(COUNTYFIPS)) %>% 
  dplyr::select(DistrictName, Name) %>% 
  distinct()

# add geometry for one high school 
# make the geometry
geometry <- st_point(x = c(-105.49524267430837,40.37119690801272), dim="xy") %>% 
  st_sfc(crs = 4326) 
geometry <- st_sf(DistrictName = "Estes Park R-3",Name = "ESTES PARK HIGH SCHOOL", geometry) %>% 
  st_transform('ESRI:102254')
# combine with high school data
school.high.point <- school.high.point %>% 
  filter(Name != "ESTES PARK HIGH SCHOOL") %>% 
  rbind(geometry)

# elementary school
school.elementary.point <- school.elementary %>% 
  st_drop_geometry() %>% 
  left_join(
    schools,
    by = c("Name" = "NAME")
  ) %>% 
  st_sf() %>% 
  filter(COUNTYFIPS %in% c("08013", "08014", "08059", "08069", "08123") |  is.na(COUNTYFIPS)) %>% 
  # remove those school with the same name
  filter(OBJECTID != "77522" | is.na(OBJECTID)) %>% 
  filter(OBJECTID != "60157" | is.na(OBJECTID)) %>%
  filter(DistrictName != "St Vrain Valley RE1J" | CITY != "BOULDER" | is.na(CITY)) %>% 
  filter(DistrictName != "Boulder Valley Re 2" | CITY != "LONGMONT" | is.na(CITY)) %>% 
  filter(Name != "MOUNTAIN VIEW ELEMENTARY SCHOOL" | CITY == "LONGMONT" | is.na(CITY)) %>% 
  dplyr::select(DistrictName,Name) %>% 
  distinct()

# add geometry for three elementary schools 
# make the geometry
geometry <- st_point(x = c(-105.49774609253775, 40.36845070512792), dim="xy") %>% 
  st_sfc(crs = 4326) 
geometry2 <- st_sf(DistrictName = "Estes Park R-3",Name = "ESTES PARK K-5 SCHOOL", geometry) %>% 
  st_transform('ESRI:102254')
geometry <- st_point(x = c(-104.92971708817917, 40.50780548561505), dim="xy") %>% 
  st_sfc(crs = 4326) 
geometry3 <- st_sf(DistrictName = "St Vrain Valley RE1J",Name = "GRAND VIEW ELEMENTARY", geometry) %>% 
  st_transform('ESRI:102254')
geometry <- st_point(x = c(-105.08177294592369, 40.07535419035197), dim="xy") %>% 
  st_sfc(crs = 4326) 
geometry4 <- st_sf(DistrictName = "Boulder Valley Re 2",Name = "MEADOWLARK SCHOOL", geometry) %>% 
  st_transform('ESRI:102254')
# combine with high school data
school.elementary.point <- school.elementary.point %>% 
  filter(Name != "ESTES PARK K-5 SCHOOL" & Name != "GRAND VIEW ELEMENTARY" & Name != "MEADOWLARK SCHOOL") %>% 
  rbind(geometry2) %>% 
  rbind(geometry3) %>% 
  rbind(geometry4)

Hospital

load hospital data

# load and select field
hospitals <- st_read("data/hOSPITALS.geojson") %>% 
  st_transform('ESRI:102254')

Census Data: block groups & unemployment

# load unemployment data
census <-  
  get_acs(geography = "block group", variables = c("B23025_003E","B23025_005E"), 
                year=2019, state=08, county=013, geometry=T) %>% 
  st_transform('ESRI:102254')
# transform the format
census <- 
  census %>%
  dplyr::select( -NAME, -moe) %>%
  spread(key = variable, value = estimate) %>%
  rename(TotalLabor = B23025_003, 
         Unemployment = B23025_005
         ) %>% 
  mutate(UnemploymentRate = Unemployment / TotalLabor) %>% 
  dplyr::select(GEOID, UnemploymentRate, geometry)

census$GEOID <- as.character(census$GEOID)

trim hospitals to those within Boulder county and select hospitals are general and open

# trim within Boulder
hospitals <- hospitals[county_boundary,] %>% 
  dplyr::filter(COUNTY == 'BOULDER') %>% 
  #select open hospitals
  dplyr::filter(STATUS == "OPEN") %>% 
  # select GENERAL MEDICAL AND SURGICAL HOSPITALS
  dplyr::filter(NAICS_DESC == "GENERAL MEDICAL AND SURGICAL HOSPITALS") %>% 
  # select variables
  dplyr::select(NAME, CITY, TRAUMA)

Historic Districts

#City of Boulder Historic Districts
histDists <- st_read("https://opendata.arcgis.com/datasets/643848efa2e24bdf8a8eafba67ec3d43_1.geojson") %>%
  st_transform('ESRI:102254')

physical_data[histDists,] %>%
  dplyr::mutate(Historic = "Yes")

Parks

# County Parks
countyParks <- st_read("https://opendata.arcgis.com/datasets/61728921abcb481fa98b9b07cfd7c95d_0.geojson") %>%
  st_transform('ESRI:102254') %>%
  filter(PARK_GROUP != "NA") %>%
  dplyr::select("PROP_NAME")

Crime

crime <- read.csv("data/Crime_Data.csv")

group_by(crime,CAD.Problem) %>%
  summarize(count = n()) %>%
  arrange(-count) %>% top_n(5)

crime.sf <-
  crime %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102254')

crime.sf <-
  crime %>%
  filter(CAD.Problem == "ASSAUS-Assault",
         Latitude > -1) %>%
  dplyr::select(Latitude, Longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102254') %>%
  distinct()

City Parks

cityPark <- st_read("https://opendata.arcgis.com/datasets/fdfc0410e75c48b6b9e53cf94bdbe224_1.geojson") %>%
  st_transform('ESRI:102254') %>%
  filter(PARKTYPE != "Undeveloped") %>%
  dplyr::rename(PROP_NAME=NAME)

cityPark_cleaned <- cityPark[ -c(4:38) ]
cityPark_scrubbed <- cityPark_cleaned[ -c(1:2, 4:6)]
#County + City Parks
allParks <- rbind(countyParks,cityPark_scrubbed)
#buffer on each house, how many parks it intersects

Trailheads

trailheads <- st_read("https://opendata.arcgis.com/datasets/3a950053bbef46c6a3c2abe3aceee3de_0.geojson") %>%
  st_transform('ESRI:102254')
#buffer on each house, how many points fall within

Feature Engineering

sum within census track (least optional for most situation)

sum within fixed buffer (assuming scale relationship uniform countywide)

average distance of k-nearest case combine school name with sale data ### Internal data

physical_data$section_num <- as.character(physical_data$section_num)
physical_data$designCode <- as.character(physical_data$designCode)
physical_data$section_num <- as.character(physical_data$section_num)
physical_data$ConstCode <- as.character(physical_data$ConstCode)
physical_data$bsmtType <- as.character(physical_data$bsmtType)
physical_data$carStorageType <- as.character(physical_data$carStorageType)
physical_data$Heating <- as.character(physical_data$Heating)
physical_data$ExtWallPrim <- as.character(physical_data$ExtWallPrim)
physical_data$IntWall <- as.character(physical_data$IntWall)

School

# combine with high school
sale <- physical_data %>% 
  st_join(school.high)
# rename column
sale <- rename(sale, c( "high_school" = "Name","high_school_points" = "Total_Percent_Points"))

# combine with elementary school
sale <- sale %>% 
  st_join(school.elementary)
# rename column
sale <- rename(sale, c( "elmtr_school" = "Name","elmtr_school_points" = "Total_Percent_Points"))

combine school distance with sale data

# add distance to high school
# make a df contain school geometry with order
school.high.order <- sale %>% 
  st_drop_geometry() %>% 
  left_join(
    school.high.point,
    by = c("high_school" = "Name")
    ) %>% 
  st_sf() %>% 
  dplyr::select(MUSA_ID)
# calculate distance
sale <- sale %>% 
  dplyr::mutate(
    hsch_distance = st_distance(sale, school.high.order, by_element = TRUE)
  )

# add distance to elementary school
# make a df contain school geometry with order
school.elementary.order <- sale %>% 
  st_drop_geometry() %>% 
  left_join(
    school.elementary.point,
    by = c("elmtr_school" = "Name", "DistrictName.y" = "DistrictName")
    ) %>% 
  st_sf() %>% 
  dplyr::select(MUSA_ID)
# calculate distance
sale <- sale %>% 
  dplyr::mutate(
    elmtr_distance = st_distance(sale, school.elementary.order, by_element = TRUE)
  )
#ALICIA SANCHEZ ELEMENTARY SCHOOL
#x <- left_join(school.tidy, school.name, by= c("NAME" = "ICSCHNAME"), keep = TRUE)

Hospital

extract city from address

# extract address and replace NA
address <- sale[5] %>% 
  st_drop_geometry()
address$address[is.na(sale$address)] <- "NA NA"
#creat a data frame to load city
city <- data.frame(city="1")
# extract city from adress
for (i in seq(1:11263)) {
  z <- strsplit(address[i,], split = " ")[[1]]
  city[i,] <- z[length(z)-1]
}
# combine with sale data
sale <- sale %>% 
  cbind(city)

sale$BoulderCity[sale$city == "BOULDER"] <- "In"
sale$BoulderCity[is.na(sale$BoulderCity)] <- "NotIn"

combine hospital level with sale data

# left join, rename column 
sale <- sale %>% 
  left_join(
    hospitals %>% 
      st_drop_geometry(),
    by = c("city" = "CITY")
  ) %>% 
  dplyr::select(-NAME) %>% 
  rename(c("hospital_level" = "TRAUMA"))
#replace NA with no_hospital
sale$hospital_level[is.na(sale$hospital_level)] <- "no_hospitals"

Census Data: block groups & unemployment

sale <- sale %>% 
  st_join(
    y = census,
    join = st_within,
    left = TRUE
  )

Parks

sale2 <- physical_data
#FE - Parks#

allParks <- allParks %>% 
  mutate(allParks, counter=1)
as.numeric(allParks$counter)

sale2$parkCount =
  st_buffer(sale2, 603) %>%
  aggregate(dplyr::select(allParks,counter),.,sum) %>%
  pull(counter)

sale2$parkCount[ is.na(sale2$parkCount)] <- 0

Crimes

#FE Crime

st_c <- st_coordinates

sale2 <-
  sale2 %>% 
  mutate(
    crime_nn1 = nn_function(st_c(sale2), st_c(crime.sf), 1),
    crime_nn2 = nn_function(st_c(sale2), st_c(crime.sf), 2), 
    crime_nn3 = nn_function(st_c(sale2), st_c(crime.sf), 3), 
    crime_nn4 = nn_function(st_c(sale2), st_c(crime.sf), 4), 
    crime_nn5 = nn_function(st_c(sale2), st_c(crime.sf), 5),
    crimes.Buffer =
      st_buffer(sale2, 1320) %>%
      aggregate(mutate(crime.sf, counter = 1),., sum) %>%
      pull(counter)
    ) 

sale2$crimes.Buffer[is.na(sale2$crimes.Buffer)] <- 0

Historic District

#FE Historic Districts

sale2 <-  st_join(
  x=sale2, 
  y=histDists, 
  join = st_within, 
  left = TRUE
)

sale2.inhist <- sale2 %>% 
  filter(!is.na(OBJECTID)) %>% 
  mutate(in_hist = "Yes") 
sale2.nohist <- sale2 %>% 
  filter(is.na(OBJECTID)) %>% 
  mutate(in_hist = "No") 
sale2 <- rbind(sale2.nohist,sale2.inhist)

sale2 <- sale2 %>%
  dplyr::select(-OBJECTID,-NAME,-DESIGNATION,-SHAPESTArea,-SHAPESTLength)

Trailheads

#FE - Trailheads

trailheads <- trailheads %>% 
  mutate(trailheads, counter=1)
as.numeric(trailheads$counter)

sale2$trailheadCount =
  st_buffer(sale2, 1608) %>%
  aggregate(dplyr::select(trailheads,counter),.,sum) %>%
  pull(counter)

sale2$trailheadCount[ is.na(sale2$trailheadCount)] <- 0

trailheads_boulder <- st_intersection(county_boundary, trailheads)

Combine All

sale.final <- left_join(
  sale %>%
    st_drop_geometry(),
  sale2 %>%
    dplyr::select(MUSA_ID, parkCount, crime_nn1, crime_nn2, crime_nn3, crime_nn4, crime_nn5, crimes.Buffer, in_hist, geometry, trailheadCount)
  ) %>%
  st_as_sf() %>%
  dplyr::select(-UnitCount, -Stories, -Roof_Cover, -ExtWallSec, -Ac)

More Feature Engineering: Log transformation

check correlation and variables’ distribution

ggpairs(select_if(st_drop_geometry(sale.final %>% dplyr::select(price, year_quarter , section_num , designCode , qualityCodeDscr , ConstCode , ConstCodeDscr , builtYear , CompCode , EffectiveYear , bsmtSF , bsmtType , carStorageType , mainfloorSF , nbrThreeQtrBaths , nbrFullBaths , nbrHalfBaths , TotalFinishedSF , AcDscr , HeatingDscr , ExtWallDscrPrim , ExtWallDscrSec , Roof_CoverDscr , high_school , elmtr_school , city , GEOID , crime_nn1 , crime_nn3 , crime_nn4 , in_hist , trailheadCount)), is.numeric) %>% na.omit(),lower= list(continuous = wrap("points", shape = I('.'))),upper = list(combo = wrap("box", outlier.shape = I('.'))))

log transform

sale.final$price[is.na(sale.final$price)] <- 0
sale.final <-
    sale.final %>%
    mutate(LNprice = log(1+price),
                 LNbsmtSF = log(1+bsmtSF),
                 LNmainfloorSF = log(1+mainfloorSF),
                 LNnbrThreeQtrBaths = log(1+nbrThreeQtrBaths),
                 LNnbrFullBaths = log(1+nbrFullBaths),
                 LNnbrHalfBaths = log(1+nbrHalfBaths),
                 LNTotalFinishedSF = log(1+TotalFinishedSF),
                 LNcrime_nn1 = log(crime_nn1),
                 LNcrime_nn2 = log(crime_nn2),
                 LNcrime_nn3 = log(crime_nn3),
                 LNcrime_nn4 = log(crime_nn4),
                 LNcrime_nn5 = log(crime_nn5),
                 LNtrailheadCount = log(1+trailheadCount),
                 LNparkCount = log(1+parkCount),
                 LNhsch_distance = log(hsch_distance)
                 )

Exploratory Analysis

Statistic Summary

stargazer(
  select_if(st_drop_geometry(sale.final) %>% filter(toPredict==0), is.numeric)%>% na.omit() %>% drop_units(), 
  title = " 1 descriptive statistic",
  type="text",
  digits = 1
  )
## 
## 1 descriptive statistic
## ==================================================================================
## Statistic             N      Mean    St. Dev.    Min   Pctl(25) Pctl(75)    Max   
## ----------------------------------------------------------------------------------
## price               11,249 746,777.5 544,084.7 10,000  453,000  833,000  7,350,000
## year                11,249  2,019.8     0.7     2,019   2,019    2,020     2,021  
## bld_num             11,249    1.0       0.1       1       1        1         9    
## qualityCode         11,249   37.3       7.9      10       30       41       80    
## bldgClass           11,249  1,212.0     0.0     1,212   1,212    1,212     1,212  
## builtYear           11,249  1,987.1    26.8     1,860   1,972    2,006     2,021  
## CompCode            11,249    1.0       0.1       0       1        1         1    
## EffectiveYear       11,249  1,997.7    17.5     1,879   1,989    2,010     2,021  
## bsmtSF              11,249   745.0     627.5      0      143     1,126     4,534  
## carStorageSF        11,249   471.6     235.9      0      380      600      3,040  
## nbrBedRoom          11,249    3.4       1.1       0       3        4        24    
## nbrRoomsNobath      11,249    7.4       2.1       0       6        9        28    
## mainfloorSF         11,249  1,313.7    607.6      0      924     1,619     7,795  
## nbrThreeQtrBaths    11,249    0.6       0.7       0       0        1         9    
## nbrFullBaths        11,249    1.8       0.9       0       1        2        12    
## nbrHalfBaths        11,249    0.6       0.5       0       0        1         3    
## TotalFinishedSF     11,249  1,954.7    892.1      0     1,276    2,443    10,188  
## MUSA_ID             11,249  5,680.1   3,280.7     1     2,834    8,520    11,364  
## toPredict           11,249    0.0       0.0       0       0        0         0    
## high_school_points  11,249   70.1      13.5     44.6     68.1     78.8     87.5   
## elmtr_school_points 11,249   66.6      14.0     37.3     57.2     73.0     100.0  
## hsch_distance       11,249  3,498.6   2,881.2   115.3  1,612.6  4,579.7  22,848.0 
## elmtr_distance      11,249  2,006.7   2,621.2   51.1    648.1   2,121.9  22,523.0 
## UnemploymentRate    11,249   0.04      0.03       0      0.02     0.1        0    
## parkCount           11,249    0.7       1.4       0       0        1        12    
## crime_nn1           11,249  3,580.9   2,508.7   23.3   1,746.8  4,819.6  22,362.5 
## crime_nn2           11,249  4,702.5   2,502.6   279.3  2,874.9  6,331.2  23,246.1 
## crime_nn3           11,249  5,349.9   2,635.4   643.4  3,368.3  7,275.4  24,321.5 
## crime_nn4           11,249  5,806.3   2,714.7  1,211.2 3,687.5  7,809.2  24,891.5 
## crime_nn5           11,249  6,207.0   2,795.4  1,579.2 4,031.2  8,277.1  25,595.5 
## crimes.Buffer       11,249    0.2       0.5       0       0        0         3    
## trailheadCount      11,249    2.9       4.1       0       1        4        29    
## LNprice             11,249   13.4       0.5      9.2     13.0     13.6     15.8   
## LNbsmtSF            11,249    5.2       2.9      0.0     5.0      7.0       8.4   
## LNmainfloorSF       11,249    7.1       0.5      0.0     6.8      7.4       9.0   
## LNnbrThreeQtrBaths  11,249    0.4       0.4       0       0       0.7        2    
## LNnbrFullBaths      11,249    1.0       0.3      0.0     0.7      1.1       2.6   
## LNnbrHalfBaths      11,249    0.4       0.4       0       0       0.7        1    
## LNTotalFinishedSF   11,249    7.5       0.5      0.0     7.2      7.8       9.2   
## LNcrime_nn1         11,249    7.9       0.8      3.2     7.5      8.5      10.0   
## LNcrime_nn2         11,249    8.3       0.6      5.6     8.0      8.8      10.1   
## LNcrime_nn3         11,249    8.4       0.6      6.5     8.1      8.9      10.1   
## LNcrime_nn4         11,249    8.5       0.5      7.1     8.2      9.0      10.1   
## LNcrime_nn5         11,249    8.6       0.5      7.4     8.3      9.0      10.2   
## LNtrailheadCount    11,249    1.0       0.8      0.0     0.7      1.6       3.4   
## LNparkCount         11,249    0.3       0.6      0.0     0.0      0.7       2.6   
## LNhsch_distance     11,249    7.9       0.8      4.7     7.4      8.4      10.0   
## ----------------------------------------------------------------------------------

Correlation matrix

numericVars <- 
  select_if(st_drop_geometry(sale.final), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#205A8C", "white", palette5[15]),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

### Home Price Correlation Scatterplots of Interest

#Scatterplots
st_drop_geometry(sale.final) %>% 
  mutate(Age = 2021 - builtYear) %>%
  dplyr::select(price, trailheadCount, LNcrime_nn5, LNhsch_distance, parkCount) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = palette5[10],size = 1.5) +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

Interesting Independent Variables Maps

#Map of Dependent Variable - Sale Price
ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  geom_sf(data = sale.final, aes(color = q5(price)),size = 1) +
  scale_color_manual(values = palette7[1:5],
                       labels = qBr(sale.final, "price",0),
                       name = "Sale Price\n(Quintile Breaks)",) +
  labs(title = "Home Sales Prices", subtitle = "Boulder County, CO") +
  mapTheme()

#Map of Independent Variable - Crime
ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(crime.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = palette5[1], high = palette5[20],
                      breaks=c(0.000000003,0.00000003),
                      labels=c("Minimum","Maximum"), name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Aggravated Assaults, Boulder County, CO") +
  mapTheme()

#Map of Independent Variable - High School Distance
ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  geom_sf(data = sale.final %>% na.omit(cols="hsch_distance"), aes(color = q5(hsch_distance)),size = 1) +
  scale_color_manual(values = palette7[1:5],
                    labels = qBr(sale.final%>% na.omit(cols="hsch_distance"), "hsch_distance",0),
                    name = "Distance to High School\n(Quintile Breaks)",) +
  labs(title = "Distance to Designated High School",
       subtitle = "Boulder County, CO") +
  mapTheme()

#Map of Independent Variable - elementary School Points
ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  geom_sf(data = sale.final %>% na.omit(cols="elmtr_school_points"), aes(color = q5(elmtr_school_points)),size = 1) +
  scale_color_manual(values = palette7[1:5],
                    labels = qBr(sale.final%>% na.omit(cols="elmtr_school_points"), "elmtr_school_points",0),
                    name = "Elementary School Ratings\n(Quintile Breaks)",) +
  labs(title = "Elementary School Points",
      subtitle = "Boulder County, CO") +
  mapTheme()

#Map of Independent Variable - Parks


ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  geom_sf(data = sale.final, aes(color = parkCount),size = 1) +
  scale_colour_gradient(low = palette5[1], high = palette5[20],
                      name = "Park amount") +
#  scale_color_manual(values = palette7[1:5],
#                    labels = qBr(sale.final, "parkCount",0),
#                    name = "Trail Jurisdiction\n(Quintile Breaks)",) +
  labs(title = "Park Amount within 3/8 Mile",
       subtitle = "Boulder County, CO") +
  mapTheme()

### any other maps/graphs/charts you think might be of interest

#Map of Independent Variable - Trailheads
ggplot() +
  geom_sf(data = school.high, fill = "grey40") +
  geom_sf(data = trailheads_boulder, aes(color = DATAOWNER)) +
  scale_color_viridis(discrete = TRUE,
                    labels = distinct(trailheads_boulder %>% dplyr::select(DATAOWNER)),
                    name = "Trail Jurisdiction\n(Quintile Breaks)",
                    option = "C",
                    alpha = 0.5) +
  labs(title = "Trailheads in Boulder County, CO") +
  mapTheme()

Regression & Cross-Validation

The main goal of the regression was to formulate an algorithm using the best internal and external variables available to predict the prices of homes in the area. To do so, we first split all of our data into two groups - one as a training set and the second as a test set. The training data was used to “borrow” experience from past home transactions (as actual known sale prices). The test set was utilized to test whether our equation was accurate. This processes was repeated several times until we had the most accurate and generalizible algorithm possible.

split data into training and test sets

toModel.1 <- sale.final %>% 
  filter(toPredict == 0)



inTrain <- createDataPartition(
             y = paste(toModel.1$quallifyCode,toModel.1$GEOID,toModel.1$city,toModel.1$mainfloorSF,toModel.1$HeatingDscr),
              p = .75, list = FALSE)
set.seed(666)

sale.training <- toModel.1[inTrain,] 
sale.test <- toModel.1[-inTrain,]  

regression models

model 1 - all variables

reg.training.1 <- lm(price ~ ., data = st_drop_geometry(sale.training) %>%
                       na.omit() %>% 
                       dplyr::select(price,year,year_quarter,bld_num,section_num,designCode,designCodeDscr,qualityCode,qualityCodeDscr,ConstCode,ConstCodeDscr,builtYear,CompCode,EffectiveYear,bsmtSF,bsmtType,bsmtTypeDscr,carStorageSF,carStorageType,carStorageTypeDscr,nbrBedRoom,nbrRoomsNobath,mainfloorSF,nbrThreeQtrBaths,nbrFullBaths,nbrHalfBaths,TotalFinishedSF,AcDscr,Heating,HeatingDscr,ExtWallPrim,ExtWallDscrPrim,ExtWallDscrSec,IntWall,IntWallDscr,Roof_CoverDscr,DistrictName.x,high_school,high_school_points,DistrictName.y,elmtr_school,elmtr_school_points,hsch_distance,elmtr_distance,city,hospital_level,GEOID,UnemploymentRate,parkCount,crime_nn1,crime_nn2,crime_nn3,crime_nn4,crime_nn5,crimes.Buffer,in_hist,trailheadCount))

model 2 - all variables after part of LOG-transformation

reg.training.2 <- lm(LNprice ~ ., data = st_drop_geometry(sale.training) %>%
                       na.omit() %>% 
                       dplyr::select(LNprice,year,year_quarter,bld_num,section_num,designCode,designCodeDscr,qualityCode,qualityCodeDscr,ConstCode,ConstCodeDscr,builtYear,CompCode,EffectiveYear,LNbsmtSF,bsmtType,bsmtTypeDscr,carStorageSF,carStorageType,carStorageTypeDscr,nbrBedRoom,nbrRoomsNobath,LNmainfloorSF,LNnbrThreeQtrBaths,LNnbrFullBaths,LNnbrHalfBaths,LNTotalFinishedSF,AcDscr,Heating,HeatingDscr,ExtWallPrim,ExtWallDscrPrim,ExtWallDscrSec,IntWall,IntWallDscr,Roof_CoverDscr,DistrictName.x,high_school,high_school_points,DistrictName.y,elmtr_school,elmtr_school_points,hsch_distance,elmtr_distance,city,hospital_level,GEOID,UnemploymentRate,parkCount,LNcrime_nn1,LNcrime_nn2,LNcrime_nn3,LNcrime_nn4,LNcrime_nn5,crimes.Buffer,in_hist,LNtrailheadCount))

More Feature Engineering: colinearity (stepwise)

model 5 - Using variables selected by stepwise except neighborhood

reg.training.5 <- lm(LNprice ~ year_quarter + bld_num + section_num + designCode + 
    qualityCodeDscr + ConstCode + builtYear + CompCode + EffectiveYear + 
    LNbsmtSF + bsmtType + carStorageSF + nbrBedRoom + nbrRoomsNobath + 
    LNmainfloorSF + LNnbrThreeQtrBaths + LNnbrFullBaths + LNnbrHalfBaths + 
    LNTotalFinishedSF + AcDscr + Heating + ExtWallPrim + ExtWallDscrSec + 
    IntWall + Roof_CoverDscr  + hsch_distance + 
    parkCount + LNcrime_nn1 + LNcrime_nn2 + LNcrime_nn3 + 
    LNcrime_nn4 + LNcrime_nn5 + LNtrailheadCount, 
    data = st_drop_geometry(sale.training))
#MAPE

reg.test5 <-
  sale.test %>%
  mutate(price.Predict = exp(predict(reg.training.5, sale.test))-1,
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict,
         Regression = "Baseline Reression") %>% 
  na.omit(cols='price.Predict') 

error.5 <- data.frame(MAE = mean(reg.test5$price.AbsError, na.rm = T), MAPE = mean(reg.test5$price.APE, na.rm = T))

kable(error.5) %>%
  kable_styling() %>%
  footnote(general_title = "\n",
           general = " 4")
MAE MAPE
71274.82 0.1222232

4
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
# cross-validation
reg.cv.5 <- 
  train(LNprice ~ year_quarter + bld_num + section_num + designCode + 
    qualityCodeDscr + ConstCode + builtYear + CompCode + EffectiveYear + 
    LNbsmtSF + bsmtType + carStorageSF + nbrBedRoom + nbrRoomsNobath + 
    LNmainfloorSF + LNnbrThreeQtrBaths + LNnbrFullBaths + LNnbrHalfBaths + 
    LNTotalFinishedSF + AcDscr + Heating + ExtWallPrim + ExtWallDscrSec + 
    IntWall + Roof_CoverDscr + high_school + elmtr_school + hsch_distance + 
    city  + parkCount + LNcrime_nn1 + LNcrime_nn2 + LNcrime_nn3 + 
    LNcrime_nn4 + LNcrime_nn5 + in_hist*BoulderCity + LNtrailheadCount + elmtr_distance,
    data = st_drop_geometry(sale.training),
    method = "lm", trControl = fitControl, na.action = na.pass)

model 6 - Using variables selected by stepwise

reg.training.6 <- lm(LNprice ~ year_quarter + bld_num + section_num + designCode + 
    qualityCodeDscr + ConstCode + builtYear + CompCode + EffectiveYear + 
    LNbsmtSF + bsmtType + carStorageSF + nbrBedRoom + nbrRoomsNobath + 
    LNmainfloorSF + LNnbrThreeQtrBaths + LNnbrFullBaths + LNnbrHalfBaths + 
    LNTotalFinishedSF + AcDscr + Heating + ExtWallPrim + ExtWallDscrSec + 
    IntWall + Roof_CoverDscr + high_school + elmtr_school + hsch_distance + 
    city + GEOID + parkCount + LNcrime_nn1 + LNcrime_nn2 + LNcrime_nn3 + 
    LNcrime_nn4 + LNcrime_nn5 + in_hist*BoulderCity + LNtrailheadCount, 
    data = st_drop_geometry(sale.training))

Regression summary results

sumResults <- summary(reg.training.6)
sumResults$coefficients %>%
  data.frame() %>%
  dplyr::select(Estimate) %>%
  rename(c("price" = "Estimate")) %>%
  rbind(sumResults$r.squared) %>%
  rbind(sumResults$adj.r.squared) %>%
  rbind(sumResults$residuals) %>%
  rbind(sumResults$fstatistic) %>%
  kable() %>%
    kable_styling() %>%
    footnote(general_title = "Regression Summary",
             general = " 5")
price
(Intercept) 7.4598627
year_quarter2019-2 0.0335817
year_quarter2019-3 0.0453795
year_quarter2019-4 0.0367003
year_quarter2020-1 0.0458373
year_quarter2020-2 0.0693606
year_quarter2020-3 0.1005017
year_quarter2020-4 0.1307702
year_quarter2021-1 0.2126215
year_quarter2021-2 0.3018166
bld_num -0.1657598
section_num2 1.0293522
section_num3 -0.6704172
section_num5 1.8158169
section_num7 3.0355786
section_num9 3.7438408
designCode0020 -0.0196481
designCode0040 0.0295933
designCode0050 0.0113024
designCode0120 -0.2209734
qualityCodeDscrAVERAGE + 0.0109581
qualityCodeDscrAVERAGE ++ 0.0354664
qualityCodeDscrEXCELLENT 0.6062930
qualityCodeDscrEXCELLENT + 0.6686337
qualityCodeDscrEXCELLENT++ 0.9027766
qualityCodeDscrEXCEPTIONAL 1 0.6315662
qualityCodeDscrEXCEPTIONAL 2 0.9061082
qualityCodeDscrFAIR -0.2190227
qualityCodeDscrGOOD 0.1165520
qualityCodeDscrGOOD + 0.1629091
qualityCodeDscrGOOD ++ 0.2341460
qualityCodeDscrLOW -0.3843010
qualityCodeDscrVERY GOOD 0.3070539
qualityCodeDscrVERY GOOD + 0.4504580
qualityCodeDscrVERY GOOD ++ 0.4719161
ConstCode300 -0.3853752
ConstCode310 0.2074687
ConstCode320 0.2063986
ConstCode330 0.1921456
ConstCode340 0.4058262
ConstCode370 -0.1086034
ConstCode410 -0.3059814
ConstCode420 -0.1226310
builtYear -0.0015143
CompCode 0.3847230
EffectiveYear 0.0037871
LNbsmtSF 0.0091530
bsmtTypeBGF 0.0167073
bsmtTypeBGU -0.0001695
bsmtTypeBSF -0.0132948
bsmtTypeBSU -0.0119250
bsmtTypeBWF 0.0005649
bsmtTypeBWU 0.0188644
bsmtTypeLGF -0.0015077
bsmtTypeLGU -0.3336202
bsmtTypeLWF -0.0012106
bsmtTypeLWU 0.0263293
carStorageSF 0.0000980
nbrBedRoom 0.0134264
nbrRoomsNobath 0.0102794
LNmainfloorSF 0.0429073
LNnbrThreeQtrBaths 0.0612289
LNnbrFullBaths 0.0882605
LNnbrHalfBaths 0.0423484
LNTotalFinishedSF 0.1762401
AcDscrAttic Fan 0.0030404
AcDscrEvaporative Cooler -0.0151701
AcDscrWhole House 0.0308116
Heating800 -0.3827337
Heating810 -0.6011830
Heating820 -0.5567802
Heating830 -0.5717347
Heating850 -0.5059535
Heating880 -0.6228230
Heating890 -0.6781688
Heating940 -0.5218799
Heating960 -1.7215423
Heating980 -0.7179403
ExtWallPrim10 -0.1635635
ExtWallPrim100 -0.1415928
ExtWallPrim110 -0.0401977
ExtWallPrim120 -0.1432925
ExtWallPrim130 -0.1494365
ExtWallPrim140 -0.1843681
ExtWallPrim20 -0.1136798
ExtWallPrim25 0.0344563
ExtWallPrim30 -0.1500496
ExtWallPrim40 -0.1550581
ExtWallPrim50 -0.0802653
ExtWallPrim60 -0.1391395
ExtWallPrim70 -0.1374087
ExtWallPrim80 -0.0507622
ExtWallPrim90 -0.1901263
ExtWallDscrSecBlock Stucco 0.4817747
ExtWallDscrSecBrick on Block -0.0206207
ExtWallDscrSecBrick Veneer -0.0131963
ExtWallDscrSecCedar 0.1362824
ExtWallDscrSecCement Board -0.0093270
ExtWallDscrSecFaux Stone 0.0268922
ExtWallDscrSecFrame Asbestos 0.0043526
ExtWallDscrSecFrame Stucco -0.0175564
ExtWallDscrSecFrame Wood/Shake -0.0171062
ExtWallDscrSecLog 0.0749300
ExtWallDscrSecMetal 0.0507799
ExtWallDscrSecMoss Rock/Flagstone 0.0658412
ExtWallDscrSecPainted Block 0.3197028
ExtWallDscrSecVinyl -0.0645936
IntWall1120 0.1250987
IntWall1130 0.0563802
IntWall1140 0.0521615
IntWall1150 -0.0003291
IntWall1160 0.0528701
Roof_CoverDscrAsphalt -0.0031834
Roof_CoverDscrBuilt-Up -0.0522176
Roof_CoverDscrClay Tile -0.0540794
Roof_CoverDscrConcrete Tile -0.0295109
Roof_CoverDscrMetal 0.0175203
Roof_CoverDscrRoll -0.4109904
Roof_CoverDscrRubber Membrane 0.0846551
Roof_CoverDscrShake -0.0932849
Roof_CoverDscrTar and Gravel 0.1729453
high_schoolBOULDER HIGH SCHOOL -0.5602365
high_schoolBROOMFIELD HIGH SCHOOL -0.5231827
high_schoolCENTAURUS HIGH SCHOOL -0.5179069
high_schoolERIE HIGH SCHOOL -0.7799698
high_schoolESTES PARK HIGH SCHOOL -0.2524986
high_schoolFAIRVIEW HIGH SCHOOL -0.6632635
high_schoolLONGMONT HIGH SCHOOL -0.1489889
high_schoolLYONS MIDDLE/SENIOR HIGH SCHOOL -0.3022848
high_schoolMEAD HIGH SCHOOL -0.0876481
high_schoolMONARCH HIGH SCHOOL -0.3569797
high_schoolNEDERLAND MIDDLE-SENIOR HIGH SCHOOL -0.1497729
high_schoolNIWOT HIGH SCHOOL -0.6537956
high_schoolSILVER CREEK HIGH SCHOOL -0.6212390
high_schoolSKYLINE HIGH SCHOOL -1.4330063
elmtr_schoolALPINE ELEMENTARY SCHOOL 0.6888960
elmtr_schoolBEAR CREEK ELEMENTARY SCHOOL 0.7197159
elmtr_schoolBLUE MOUNTAIN ELEMENTARY 0.5014399
elmtr_schoolBURLINGTON ELEMENTARY SCHOOL 0.2947118
elmtr_schoolCENTRAL ELEMENTARY SCHOOL -0.4646771
elmtr_schoolCOAL CREEK ELEMENTARY SCHOOL -0.3907149
elmtr_schoolCOLUMBINE ELEMENTARY SCHOOL 0.5330728
elmtr_schoolCREEKSIDE ELEMENTARY SCHOOL AT MARTIN PARK 0.7186825
elmtr_schoolCREST VIEW ELEMENTARY SCHOOL 0.6168674
elmtr_schoolDOUGLASS ELEMENTARY SCHOOL -0.0204279
elmtr_schoolEAGLE CREST ELEMENTARY SCHOOL 0.5247008
elmtr_schoolEISENHOWER ELEMENTARY SCHOOL -0.1310886
elmtr_schoolELDORADO K-8 SCHOOL -0.0587786
elmtr_schoolERIE ELEMENTARY SCHOOL 0.0493967
elmtr_schoolFALL RIVER ELEMENTARY SCHOOL 0.6622773
elmtr_schoolFIRESIDE ELEMENTARY SCHOOL -0.5165781
elmtr_schoolFLATIRONS ELEMENTARY SCHOOL 0.5615314
elmtr_schoolFOOTHILL ELEMENTARY SCHOOL 0.5820641
elmtr_schoolGOLD HILL ELEMENTARY SCHOOL 0.7573266
elmtr_schoolHEATHERWOOD ELEMENTARY SCHOOL -0.0739014
elmtr_schoolHYGIENE ELEMENTARY SCHOOL -0.0861661
elmtr_schoolINDIAN PEAKS ELEMENTARY SCHOOL 0.4151942
elmtr_schoolJAMESTOWN ELEMENTARY SCHOOL 0.4896574
elmtr_schoolLAFAYETTE ELEMENTARY SCHOOL -0.2946388
elmtr_schoolLONGMONT ESTATES ELEMENTARY SCHOOL 0.5131170
elmtr_schoolLOUISVILLE ELEMENTARY SCHOOL -0.4632508
elmtr_schoolMEADOWLARK SCHOOL -0.2808503
elmtr_schoolMESA ELEMENTARY SCHOOL 0.7212740
elmtr_schoolMONARCH K-8 SCHOOL -0.4442214
elmtr_schoolMOUNTAIN VIEW ELEMENTARY SCHOOL -0.5041056
elmtr_schoolNORTHRIDGE ELEMENTARY SCHOOL 0.0467998
elmtr_schoolROCKY MOUNTAIN ELEMENTARY SCHOOL 0.5782444
elmtr_schoolRYAN ELEMENTARY SCHOOL -0.3548475
elmtr_schoolSUPERIOR ELEMENTARY SCHOOL -0.4573650
elmtr_schoolTIMBERLINE PK-8 0.6305236
elmtr_schoolWHITTIER ELEMENTARY SCHOOL 0.7232478
hsch_distance 0.0000068
cityBOULDER -0.1362899
cityERIE -0.6359248
cityJAMESTOWN -0.0798233
cityLAFAYETTE -0.3917227
cityLONGMONT -0.3198007
cityLOUISVILLE -0.3704863
cityLYONS -0.2954004
cityNA -0.1987259
cityNEDERLAND -0.2331567
citySUPERIOR -0.4542554
cityUNINCORPORATED -0.1469000
cityWARD -0.0773527
GEOID080130121012 0.0285272
GEOID080130121013 0.0557880
GEOID080130121014 0.0752840
GEOID080130121021 -0.3821941
GEOID080130121022 -0.1080845
GEOID080130121023 -0.1925028
GEOID080130121024 -0.1482172
GEOID080130121025 -0.1087756
GEOID080130121031 -0.1311942
GEOID080130121032 -0.2116225
GEOID080130121033 -0.3437856
GEOID080130121041 -0.2436681
GEOID080130121042 -0.2119922
GEOID080130121051 -0.5556063
GEOID080130121052 -0.3994996
GEOID080130121053 -0.2440347
GEOID080130121054 -0.2522564
GEOID080130122011 -0.1730375
GEOID080130122012 -0.0158474
GEOID080130122013 0.0396698
GEOID080130122021 -0.1564594
GEOID080130122022 -0.0293721
GEOID080130122023 -0.2568376
GEOID080130122024 -0.4728284
GEOID080130122031 -0.5133696
GEOID080130122032 -0.4345709
GEOID080130122033 -0.5327349
GEOID080130122034 -0.4155949
GEOID080130122041 -0.0538505
GEOID080130122042 -0.1767728
GEOID080130122043 0.1313761
GEOID080130123001 0.1607723
GEOID080130123002 -0.4201052
GEOID080130124011 0.0354901
GEOID080130124012 -0.1460257
GEOID080130124013 -0.1049876
GEOID080130124014 -0.1458284
GEOID080130125011 0.3217954
GEOID080130125012 0.4349272
GEOID080130125051 -0.4167128
GEOID080130125052 -0.1776487
GEOID080130125053 -0.1653207
GEOID080130125071 -0.4916245
GEOID080130125072 -0.4788213
GEOID080130125073 -0.3937267
GEOID080130125081 -0.4076188
GEOID080130125082 -0.4817384
GEOID080130125083 -0.4422143
GEOID080130125091 -0.4493696
GEOID080130125092 -0.2947998
GEOID080130125093 -0.2437875
GEOID080130125101 -0.1141773
GEOID080130125102 -0.1742191
GEOID080130125103 -0.1327010
GEOID080130125104 -0.1404226
GEOID080130125111 0.3541047
GEOID080130125112 0.3738362
GEOID080130126031 0.4011392
GEOID080130126032 0.5155477
GEOID080130126052 -0.4575001
GEOID080130126074 -0.3950531
GEOID080130126081 -0.4698919
GEOID080130127011 -0.2415811
GEOID080130127012 -0.4992194
GEOID080130127013 -0.4474991
GEOID080130127014 -0.5761053
GEOID080130127051 -0.0479544
GEOID080130127052 -0.6987927
GEOID080130127053 -0.6513423
GEOID080130127071 -0.5467711
GEOID080130127072 0.3618107
GEOID080130127081 0.0272080
GEOID080130127082 0.0399416
GEOID080130127083 0.0170129
GEOID080130127084 -0.0943914
GEOID080130127091 -0.0309247
GEOID080130127101 0.4701669
GEOID080130127102 0.3457681
GEOID080130127103 0.4063581
GEOID080130128001 0.3370465
GEOID080130128002 0.2915309
GEOID080130128003 0.3613284
GEOID080130128004 0.3745812
GEOID080130128005 0.2896791
GEOID080130129031 0.3366830
GEOID080130129041 0.3411790
GEOID080130129042 0.3572986
GEOID080130129051 0.3957092
GEOID080130129052 0.3918470
GEOID080130129071 0.0166751
GEOID080130129072 0.0295085
GEOID080130130031 0.1416548
GEOID080130130032 0.4316856
GEOID080130130033 0.3940933
GEOID080130130034 0.3956849
GEOID080130130041 0.4415588
GEOID080130130042 0.4432115
GEOID080130130051 0.5418941
GEOID080130130052 0.5837754
GEOID080130130061 0.6411419
GEOID080130130062 0.5620311
GEOID080130130063 0.5299065
GEOID080130132011 -0.5228961
GEOID080130132021 -0.2255155
GEOID080130132022 -0.0513867
GEOID080130132051 0.1339036
GEOID080130132052 0.1183008
GEOID080130132053 -0.0456595
GEOID080130132071 -0.5136689
GEOID080130132073 -0.5554002
GEOID080130132081 -0.7020884
GEOID080130132082 -0.6606669
GEOID080130132083 -0.6518799
GEOID080130132101 -0.4834358
GEOID080130132102 -0.5308703
GEOID080130132103 -0.4473591
GEOID080130132111 -0.3631865
GEOID080130132112 -0.4323950
GEOID080130132113 -0.6366977
GEOID080130132121 -0.6154950
GEOID080130132122 -0.5356452
GEOID080130132123 -0.5126049
GEOID080130132131 -0.5207211
GEOID080130132132 -0.5992122
GEOID080130132133 -0.4973080
GEOID080130133021 0.1553242
GEOID080130133022 -0.0837941
GEOID080130133023 -0.0341075
GEOID080130133024 0.0078288
GEOID080130133051 -0.6641496
GEOID080130133052 -0.6025627
GEOID080130133053 -0.6470284
GEOID080130133061 -0.6317135
GEOID080130133062 -0.6189368
GEOID080130133071 -0.0771260
GEOID080130133072 -0.1049356
GEOID080130133081 -0.0674644
GEOID080130133082 -0.0531330
GEOID080130134011 0.0241918
GEOID080130134012 0.1521335
GEOID080130134013 0.1401861
GEOID080130134021 -0.0948260
GEOID080130134022 -0.0532732
GEOID080130134023 0.0893205
GEOID080130134024 -0.0069168
GEOID080130134025 0.1180879
GEOID080130135031 0.0313662
GEOID080130135032 -0.0242666
GEOID080130135033 0.1701260
GEOID080130135034 0.1818097
GEOID080130135051 0.0087118
GEOID080130135052 0.0874011
GEOID080130135053 0.0079358
GEOID080130135061 -0.0754728
GEOID080130135071 0.0125322
GEOID080130135081 0.0205186
GEOID080130135082 -0.0013412
GEOID080130136011 -0.3713145
GEOID080130136012 -0.4200530
GEOID080130136013 -0.5937147
GEOID080130136021 -0.6590060
GEOID080130136022 -0.8158686
GEOID080130137011 -0.4934858
GEOID080130137012 -0.3502156
GEOID080130137013 -0.2859969
GEOID080130137014 -0.7870444
GEOID080130137021 -0.4416028
GEOID080130137022 -0.4646220
GEOID080130137023 -0.4439887
GEOID080130137024 -0.6885481
GEOID080130137025 -0.4365613
GEOID080130137026 -0.4803994
GEOID080130137027 -0.4952721
GEOID080130606001 0.2592526
GEOID080130606002 0.4391111
GEOID080130607001 0.4196943
GEOID080130607002 0.3987227
GEOID080130607003 0.4035943
GEOID080130608002 0.0565793
GEOID080130608003 0.3178374
GEOID080130608004 0.1262724
GEOID080130609001 0.4082413
GEOID080130609002 0.3681193
GEOID080130613001 0.3796318
GEOID080130613002 0.4053398
parkCount 0.0060073
LNcrime_nn1 0.0397733
LNcrime_nn2 -0.1710341
LNcrime_nn3 0.4747803
LNcrime_nn4 -0.6680690
LNcrime_nn5 0.3527733
in_histYes -0.1143257
LNtrailheadCount 0.0357702
in_histYes:BoulderCityNotIn 0.0976393
379 0.8123423
380 0.8053154
381 -0.1141251
382 115.6044832
Regression Summary
5

MAE & MAPE

#MAPE

reg.test6 <-
  sale.test %>%
  mutate(price.Predict = exp(predict(reg.training.6, sale.test))-1) %>% 
  na.omit(cols='price.Predict') %>% 
  mutate(price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict,
         Regression = "Neighborhood Effects")

#Polished  of MAE & MAPE
error. <- data.frame(MAE = mean(reg.test6$price.AbsError, na.rm = T), MAPE = mean(reg.test6$price.APE, na.rm = T))

kable(error.) %>%
  kable_styling() %>%
  footnote(general_title = "\n",
           general = " 6")
MAE MAPE
48745.49 0.088194

6

cross validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

# cross-validation
reg.cv.6 <- 
  train(LNprice ~ year_quarter + bld_num + section_num + designCode + 
    qualityCodeDscr + ConstCode + builtYear + CompCode + EffectiveYear + 
    LNbsmtSF + bsmtType + carStorageSF + nbrBedRoom + nbrRoomsNobath + 
    LNmainfloorSF + LNnbrThreeQtrBaths + LNnbrFullBaths + LNnbrHalfBaths + 
    LNTotalFinishedSF + AcDscr + Heating + ExtWallPrim + ExtWallDscrSec + 
    IntWall + Roof_CoverDscr + high_school + elmtr_school + hsch_distance + 
    city + GEOID + parkCount + LNcrime_nn1 + LNcrime_nn2 + LNcrime_nn3 + 
    LNcrime_nn4 + LNcrime_nn5 + in_hist*BoulderCity + LNtrailheadCount, 
    data = st_drop_geometry(sale.training),
    method = "lm", trControl = fitControl, na.action = na.pass)

#Polished  of MAE, Rsquared & RMSE
error. <- data.frame(MAE = mean(reg.cv.6, na.rm = T), MAPE = mean(reg.cv.6$price.APE, na.rm = T))

kable(reg.cv.6$resample %>%  summarise(RMSE = mean(RMSE), Rsquared = mean(Rsquared), MAE = mean(MAE))) %>%
  kable_styling() %>%
  footnote(general_title = "\n",
           general = " 7")
RMSE Rsquared MAE
0.2401742 0.7898603 0.1348404

7
#kable(reg.cv.6) %>%
#  kable_styling() %>%
#  footnote(general_title = "\n",
#           general = " 7")

Maps & plots of all predicted values

predicted prices as a function of observed prices

#Plot Predicted Prices as f(x) of observed prices
reg.test6 %>%
    dplyr::select(price.Predict, price) %>%
    ggplot(aes(price, price.Predict)) +
    geom_point() +
    stat_smooth(aes(price, price),
                method = "lm", se = FALSE, size = 1.5, colour=palette6[1]) +
    stat_smooth(aes(price.Predict, price), 
                method = "lm", se = FALSE, size = 1.5, colour=palette6[2]) +
    labs(title="Predicted sale price as a function of observed price",
         subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
    plotTheme()

predicted values maps

# Predict price for all
sale.predict <- sale.final %>%
  mutate(price.Predict = exp(predict(reg.training.6.2, sale.final))-1,
         Regression = "Neighborhood Effects") %>% 
  na.omit(cols='price.Predict')

#Map of Dependent Variable - Sale Price
ggplot() +
  geom_sf(data = census, fill = "grey40", size=1) +
  geom_sf(data = sale.predict , aes(color = q5(price.Predict)),size=.5) +
  scale_color_manual(values = palette7[1:5],
                       labels = qBr(sale.predict, "price.Predict",0),
                       name = "Predict Price\n(Quintile Breaks)") +
  labs(title = "Predicted Sales Prices", subtitle = "Boulder County, CO") +
  mapTheme()

Special Evaluation

# Create spatial weights
coords.test <-  st_coordinates(reg.test6) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")

Spatial Cluster of Errors

Absolute error Map by neighborhood

ggplot() +
  geom_sf(data = census, fill = "grey40", size=1) +
  geom_sf(data = reg.test6 , aes(color = q5(price.AbsError)), size = .8) +
  scale_color_manual(values = palette7[1:5],
                       labels = qBr(reg.test6, "price.AbsError",0),
                       name = "Model Errors\n(Quintile Breaks)") +
  labs(title = "Sale price absolute errors on the test set", subtitle = "Boulder County, CO") +
  mapTheme()

Error as a function of the spatial lag of price

Cluster of errors - The Spatial Lag

# Make a map of lag errors
reg.test6 %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error)) %>%
  ggplot(aes(lagPriceError, price.Error))+
  geom_point() +
  stat_smooth(aes(lagPriceError, price.Error), 
             method = "lm", se = FALSE, size = 2, colour=palette5[10]) + 
  labs(title = "Error as a function of the spatial lag of price",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Price errors") +
  plotTheme()

Moran’s I test

Cluster of errors - Moran’s I test

# Do Moran test
moranTest <- moran.mc(reg.test6$price.Error %>% na.omit(), 
                      spatialWeights.test, nsim = 999)

# Plot the test result
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(fill = palette5[1],binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = palette5[10],size=2) +
  scale_x_continuous(limits = c(-0.2, 0.2)) +
  labs(title="Observed and Permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

MAPE in Space

MAPE map by neighborhood

# combine 2 baseline regression with neiborhood effects regression
bothRegressions <- 
  rbind(
    dplyr::select(reg.test5, starts_with("price"), Regression, GEOID) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error)),
    dplyr::select(reg.test6, starts_with("price"), Regression, GEOID) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error)))  

# map
st_drop_geometry(bothRegressions) %>%
  group_by(Regression, GEOID) %>%
  summarize(mean.MAPE = mean(price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(census) %>%
    st_sf() %>%
    ggplot() + 
  geom_sf(data = census, fill = "#F5F5F5", size=.5) +
  geom_sf(aes(fill = mean.MAPE), size=.5) +
  geom_sf(data = bothRegressions, colour = "grey10", size = .7) +
  facet_wrap(~Regression,ncol = 1) +
  scale_fill_gradient(low = palette5[1], high = palette5[20],
                      name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood") +
  mapTheme()

MAPE by neighborhood as a function of mean price by neighborhood

reg.test6 %>% 
  group_by(GEOID) %>% 
  summarise(MAPE = mean(price.APE), MeanPrice = mean(price)) %>% 
  st_drop_geometry() %>% 
  ggplot(aes(MeanPrice,MAPE))+
  geom_point() +
  stat_smooth(aes(MeanPrice,MAPE), 
            method = "lm", se = FALSE, size = 2, colour=palette5[10]) + 
  labs(title = "MAPE by neighborhood as a function of mean price by neighborhood",
       x = "Mean sale Price by neighborhood",
       y = "Mean absolute percentage error by neighborhood") +
  plotTheme()

Accounting for neighborhoods

Generalizability by census race, income and unemployment

tracts17 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001E","B23025_003E","B23025_005E"), 
          year = 2017, state=25, county=025, geometry=T, output = "wide") %>%
  st_transform('ESRI:102286')  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E,
         TotalLabor = B23025_003E, 
         Unemployment = B23025_005E) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         UnemploymentRate = Unemployment / TotalLabor,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"),
         UnemploymentContext = ifelse(UnemploymentRate > 0.03757467, "High Unemployment", "Low Unemployment"))

grid.arrange(ncol = 3,
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = raceContext),color=palette5[1],size=.5) +
    scale_fill_manual(values = c(palette6[2],palette6[1]), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = incomeContext),color=palette5[1],size=.5) +
    scale_fill_manual(values = c(palette6[2],palette6[1]), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"),
  ggplot() + geom_sf(data = na.omit(tracts17), aes(fill = UnemploymentContext),color=palette5[1],size=.5) +
    scale_fill_manual(values = c(palette6[2],palette6[1]), name="Unemployment Context") +
    labs(title = "Unemployment Context") +
    mapTheme() + theme(legend.position="bottom"))

Discussion

While data science can never achieve perfect models, the datasets and methods used in this model were overall very effective. A few datasets that were of interest were crime, mainly because it was a difficult to obtain dataset, as well as historic districts, which to our surprise, had an inverse effect on house prices. Surprisingly, one of our more interesting variables was heating (one of the variables that was inherent in the house data) as it had a larger than normal impact on our analysis. Were were able to predict ~80% of the variation in home sale prices. Our neighborhood data as well as a few outstanding high schools and elementary schools turned out to be among our most important features. We observed a error rate of ~13% in the generalizability of the model due to several spatial variables that were not included in our model. According to our maps we were able to account for the spatial variation in prices. Our model predicted particularly well in the the eastern section of the county with one small neighborhood as the exception in the City of Boulder. And due to the dearth of sale data in the western county, we don’t have test data for that area, which means the predict accuracy there is less ascertainable, although the result of cross-validation confirm the generalizability of our prediction.

Conclusion

We would absolutely recommend out model to Zillow. We were able to predict ~80% of the variation in home prices and only observed an error rate of ~13% in the generalizbility of the model. Physical characteristic, public services, crime, and spatial clustering were all components of this analysis which provided a comprehensive model. Additionally, it was create by two Ivy League students, which means it must be good. Our use of census block groups as neighborhoods was not ideal, and the number of observation in the western county was not sufficient. In a future analysis we would prefer to find or create a more representational boundary set that more accurately represents socioeconomic boundaries. On top of that, training this model with more evenly distributed data will contribute to a higher generalizability.